Sensor network system for acuiring high quality speech signals and communication method therefor

ABSTRACT

A sensor network system including node devices connected in a network via predetermined propagation paths collects data measured at each node device to be aggregated into one base station via a time-synchronized sensor network system. The base station calculates a position of the signal source based on the angle estimation value of the signal from each node device and position information thereof, designates a node device located nearest to the signal source as a cluster head node device, and transmits information of the position of the signal source and the designated cluster head node device to each node device, to cluster each node device located within the number of hops from the cluster head node device as a node device belonging to each cluster. Each node device performs an emphasizing process on the received signal from the signal source, and transmits an emphasized signal to the base station.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a sensor network system such as amicrophone array network system which is provided for acquiring a speechof a high sound quality and a communication method therefor.

2. Description of the Related Art

Conventionally, in an application system (e.g., an audio teleconferencesystem in which a plurality of microphones are connected, a speechrecognition robot system, a system having various speech interfaces),which utilizes a vocal sound, various speech processing practices ofspeech source localization, speech source separation, noisecancellation, echo cancellation and so on are performed to utilize thevocal sound with a high sound quality. In particular, microphone arraysmainly intended for the processing of speech source localization andspeech source separation are broadly researched for the purpose ofacquiring a vocal sound with a high sound quality. In this case, thespeech source localization specifies the direction and position of aspeech source from sound arrival time differences, and the speech sourceseparation is to extract a specific speech source in a specificdirection by erasing sound sources that become noises by utilizing theresults of speech source localization.

It has been known that the speech processing using microphone arraysnormally improves its speech processing performance of noise processingand the like with an increased number of microphones. Moreover, in suchspeech processing, there is a number of speech source localizationtechniques using the position information of a speech source (See, forexample, a Non-Patent Document 1). The speech processing becomes moreeffective as the results of speech source localization have betteraccuracy. In other words, it is required to concurrently improve theaccuracy of the speech source localization and the noise cancellationintended for higher sound quality by increasing the number ofmicrophones.

In a speech source localization method using a conventional large-scalemicrophone array, the positional range of a speech source is dividedinto positional ranges in a shape of mesh, and the speech sourcepositions are stochastically calculated for respective intervals. Forthis calculation, there has been the practice of collecting all speechdata in a speech processing server such as a work stations in one placeand collectively processing all the speech data to estimate the positionof the speech source (See, for example, a Non-Patent Document 2). In thecase of the collective processing of all speech data as described above,the signal wiring length and communication traffic between themicrophones for vocal sound collection and the speech processing server,and the calculation amount in the speech processing server have beenvast. There is such a problem that the microphones cannot be increasedin number due to the following:

(a) the increase in the wiring length, the communication traffic and thecalculation amount in the speech processing server, and;

(b) such a physical limitation that a number of A/D converters cannot bearranged in one place of the speech processing server.

Moreover, there is also such a problem of occurrence of noises due tothe increase in the signal wiring length. Therefore, there occurred aproblem of difficulties in increasing the number of microphones intendedfor higher sound quality.

As a method for making improvements concerning the above problems, therehas been known a speech processing system with a microphone array inwhich a plurality of microphones are grouped into small arrays and theyare aggregated (See, for example, a Non-Patent Document 3). However,even in such a speech processing system, the speech data of all themicrophones obtained in small arrays are aggregated into the speechserver in one place via a network, and therefore, this leads to aproblem of increase in the communication traffic of the network.Moreover, there is such a problem that a speech processing delay occursin accordance with the increase in the communication data amount and thecommunication traffic amount.

Moreover, in order to satisfy demands for sound pickup in a ubiquitoussystem and a television conference system in the future, a greaternumber of microphones are necessary (See, for example, the PatentDocument 1). However, in the current network system with a microphonearray as described above, the speech data obtained by the microphonearray is merely transmitted to the server as it is. We found out nosystem in which node devices of a microphone array mutually exchangeposition information of the speech source to reduce the calculationamount of the calculation amount in the entire system and reduce thecommunication traffic of the network. Therefore, a system architecturebecomes important which reduces the calculation amount of the entiresystem and suppresses the communication traffic of the network byassuming an increase in the scale of the microphone array networksystem.

As described above, it has been demanded to improve the speech sourcelocalization accuracy by using a number of microphone arrays withsuppressing the communication traffic and the calculation amount in thespeech processing server and to effectively perform the speechprocessing of noise cancellation and so on. Moreover, a positionmeasurement system using a speech source is proposed in these latterdays. For example, the Patent Document 2 discloses computation of anultrasonic tag by using an ultrasonic tag and a microphone array.Further, the Patent Document 3 discloses sound pickup by using amicrophone array.

Prior art documents related to the present invention are as follows:

PATENT DOCUMENTS

-   Patent Document 1: Japanese patent laid-open publication No. JP    2008-113164 A; and-   Patent Document 2: Pamphlet of International Publication No. WO    2008/026463 A1;-   Patent Document 3: Japanese patent laid-open publication No. JP    2008-058342 A; and-   Patent Document 4: Japanese patent laid-open publication No. JP    2008-099075 A.

NON-PATENT DOCUMENTS

-   Non-Patent Document 1: Ralph O. Schmidt, “Multiple emitter location    and signal parameter estimation”, In Proceedings of IEEE    Transactions on Antennas and Propagation, Vol. AP-34, No. 3, March    1986.-   Non-Patent Document 2: Eugene Weinstein et al., “Loud: A 1020-node    modular microphone array and beamformer for intelligent computing    spaces”, MIT, MIT/LCS Technical Memo MIT-LCS-TM-642, April 2004.-   Non-Patent Document 3: Alessio Brutti et al., “Classification of    Acoustic Maps to Determine Speaker Position and Orientation from a    Distributed Microphone Network”, In Proceedings of ICASSP, Vol. IV,    pp. 493-496, April. 2007.-   Non-Patent Document 4: Wendi Rabiner Heinzelman et al.,    “Energy-Efficient Communication Protocol for Wireless Microsensor    Networks”, Proceedings of the 33rd Hawaii International Conference    on System Sciences, 2000, Vol. 8, pp. 1-10, January 2000.-   Non-Patent Document 5: Vivek Katiyar et al., “A Survey on Clustering    Algorithms for Heterogeneous Wireless Sensor Networks”,    International Journal of Advanced Networking and Applications, Vol.    02, Issue 04, pp. 745-754, 2011.-   Non-Patent Document 6: J. Benesty et al., “Springs Handbook of    Speech Processing”, Springer, 50. Microphone arrays, pp. 1021-1041,    2008.-   Non-Patent Document 7: Futoshi Asano et al., “Sound Source    Localization and Signal Separation for Office Robot “Jijo-2””,    Proceedings of the 1999 IEEE International Conference on Multisensor    Fusion and Integration for Intelligent Systems, Taipei, Taiwan,    R.O.C., pp. 243-248, August 1999.-   Non-Patent Document 8: Miklos Maroti et al., “The Flooding Time    Synchronization Protocol”, Proceedings of 2nd ACM SenSys, pp. 39-49,    November 2004.-   Non-Patent Document 9: Takashi Takeuchi et al., “Cross-Layer Design    for Low-Power Wireless Sensor Node Using Wave Clock”, IEICE    Transactions on Communications, Vol. E91-B, No. 11, pp. 3480-3488,    November 2008.-   Non-Patent Document 10: Maleq Khan et al., “Distributed Algorithms    for Constructing Approximate Minimum Spanning Trees in Wireless    Networks”, IEEE Transactions on Parallel and Distributed Systems,    Vol. 20, No 1, pp. 124-139, January 2009.-   Non-Patent Document 11: Wei Ye et al., “Medium Access Control With    Coordinated Adaptive Sleeping for Wireless Sensor Networks”, In    proceedings of IEEE/ACM Transactions on Networking, Vol. 12, No. 3,    pp. 493-506, 2004.

However, the position measurement function of the GPS system and theWiFi system mounted on many mobile terminals had such a problem that apositional relation between terminals at a short distance of tens ofcentimeters cannot be acquired even though a rough position on a map canbe acquired.

For example, the Non-Patent Document 4 discloses a communicationprotocol to perform wireless communications by efficiently usingtransmission energy in a wireless sensor network. Moreover, theNon-Patent Document 5 discloses using a clustering technique forlengthening the lifetime of the sensor network as a method for reducingthe energy consumption in a wireless sensor network.

However, the prior art clustering method, which is a technique limitedto a network layer, considers neither the sensing object (applicationlayer) nor the hardware configuration of node devices. This led to sucha problem that the prior art technique is not adapted to an applicationthat needs to configure paths based on the actual physical signal sourceposition.

SUMMARY OF THE INVENTION

An object of the present invention is to solve the aforementionedproblems and provide a sensor network system capable of performing dataaggregation more efficiently than in the prior art, remarkably reducingthe network traffic and reducing the power consumption of the sensornode devices in a sensor network system of, for example, a microphonearray network system, and a communication method therefor.

In order to achieve the aforementioned objective, according to oneaspect of the present invention, there is provided a sensor networksystem including a plurality of node devices each having a sensor arrayand known position information. The node devices are connected with eachother in a network via predetermined propagation paths by using apredetermined communication protocol, and the sensor network systemcollects data measured at each of the node devices so as to beaggregated into one base station by using a time-synchronized sensornetwork system. Each of the node devices includes a sensor, a directionestimation processor part, and a communication processor part. Thesensor array is configured to arrange a plurality of sensors in an arrayform. The direction estimation processor part operates when detecting asignal from a predetermined signal source received by the sensor arrayon the basis of the signal, to transmit a detected message to the basestation and to estimate an arrival direction angle of the signal andtransmit an angle estimation value to the base station, and is activatedin response to an activation message at a time of detecting a signalreceived via a predetermined number of hops from other node devices toestimate an arrival direction angle of the signal and transmit an angleestimation value to the base station. The communication processor partperforms an emphasizing process on a signal from a predetermined signalsource received by the sensor array for each of the node devicesbelonging to a cluster designated by the base station in correspondencewith the speech source, and transmits a signal that has undergone theemphasizing process to the base station. The base station calculates aposition of the signal source on the basis of the angle estimation valueof the signal from each of the node devices and position information ofeach of the node devices, designates a node device located nearest tothe signal source as a cluster head node device, and transmitsinformation of the position of the signal source and the designatedcluster head node device to each of the node devices, thereby clusteringeach of the node devices located within the number of hops from thecluster head node device as a node belonging to each cluster. Each ofthe node devices performs an emphasizing process on the signal from thepredetermined signal source received by the sensor array for each of thenode devices belonging to the cluster designated by the base station incorrespondence with the speech source, and transmits the signal that hasundergone the emphasizing process to the base station.

In the above-mentioned sensor network system, each of the node devicesis set into a sleep mode before detecting the signal and beforereceiving the activation message, and power supply to circuits otherthan a circuit that detects the signal and a circuit that receives theactivation message are stopped.

In addition, in the above-mentioned sensor network system, the sensor isa microphone to detect a speech.

According to another aspect of the present invention, there is provide acommunication method for use in a sensor network system including aplurality of node devices each having a sensor array and known positioninformation. The node devices are connected with each other in a networkvia predetermined propagation paths by using a predeterminedcommunication protocol, and the sensor network system collects datameasured at each of the node devices so as to be aggregated into onebase station by using a time-synchronized sensor network system. Each ofthe node devices includes a sensor array, a direction estimationprocessor part, and a communication processor part. The sensor array isconfigured to arrange a plurality of sensors in an array form. Thedirection estimation processor part operates when detecting a signalfrom a predetermined signal source received by the sensor array on thebasis of the signal, to transmit a detected message to the base stationand to estimate an arrival direction angle of the signal and transmit anangle estimation value to the base station and is activated in responseto an activation message at a time of detecting a signal received via apredetermined number of hops from other node devices to estimate anarrival direction angle of the signal and transmit an angle estimationvalue to the base station. The communication processor part performs anemphasizing process on a signal from a predetermined signal sourcereceived by the sensor array for each of the node devices belonging to acluster designated by the base station in correspondence with the speechsource, and transmits a signal that has undergone the emphasizingprocess to the base station. The communication method including thefollowing steps:

calculating by the base station a position of the signal source on thebasis of the angle estimation value of the signal from each of the nodedevices and position information of each of the node devices,designating a node device located nearest to the signal source as acluster head node device, and transmitting information of the positionof the signal source and the designated cluster head node device to eachof the node devices, thereby clustering each of the node devices locatedwithin the number of hops from the cluster head node device as a nodedevice belonging to each cluster, and

performing an emphasizing process by each of the node devices on thesignal from the predetermined signal source received by the sensor arrayfor each of the node devices belonging to the cluster designated by thebase station in correspondence with the speech source, and transmittingthe signal that has undergone the emphasizing process to the basestation.

The above-mentioned communication method further includes a step ofsetting each of the node devices into a sleep mode before detecting thesignal and before receiving the activation message, and stopping powersupply to circuits other than a circuit that detects the signal and acircuit that receives the activation message.

In addition, in the above-mentioned communication method, the sensor isa microphone to detect a speech.

Therefore, according to the sensor network system and the communicationmethod therefor of the present invention, by configuring the networkpaths specialized for data aggregation coping with the physicalarrangement of a plurality of signal sources by utilizing the signal ofthe object of sensing for the clustering, cluster head determination,and routing on the sensor network, redundant paths are reduced, and theefficiency of data aggregation can be improved at the same time.Moreover, by virtue of the reduced communication overhead forconfiguring the paths, the network traffic is reduced, and the operatingtime of the communication circuit of large power consumption can bereduced. Therefore, the data aggregation can be performed moreefficiently, the network traffic can be remarkably reduced, and thepower consumption of the sensor node device can be reduced in the sensornetwork system by comparison to the prior art.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects and features of the present invention willbecome clear from the following description taken in conjunction withthe preferred embodiments thereof with reference to the accompanyingdrawings throughout which like parts are designated by like referencenumerals, and in which:

FIG. 1 is a block diagram showing a detailed configuration of a nodedevice that is used in a speech source localization system according toa first preferred embodiment and in a position measurement systemaccording to a second preferred embodiment of the present invention;

FIG. 2 is a flow chart showing processing in a microphone array networksystem used in the system of FIG. 1;

FIG. 3 is a waveform chart showing speech activity detection (VAD) atzero-cross points used in the system of FIG. 1;

FIG. 4 is a block diagram showing a detail of a delay-sum circuit partused in the system of FIG. 1;

FIG. 5 is a plan view showing a basic principle of a plurality ofdistributedly arranged delay-sum circuit parts of FIG. 4;

FIG. 6 is a graph showing a time delay from a speech source indicativeof operation in the system of FIG. 5;

FIG. 7 is an explanatory view showing a configuration of a speech sourcelocalization system of the first preferred embodiment;

FIG. 8 is an explanatory view for explaining two-dimensional speechsource localization in the speech source localization system of FIG. 7;

FIG. 9 is an explanatory view for explaining three-dimensional speechsource localization in the speech source localization system of FIG. 7;

FIG. 10 is a schematic view showing a configuration of a microphonearray network system according to a first implemental example of thepresent invention;

FIG. 11 is a schematic view showing a configuration of a node devicehaving the microphone array of FIG. 10;

FIG. 12 is a functional diagram showing functions of the microphonearray network system of FIG. 7;

FIG. 13 is an explanatory view for explaining experiments ofthree-dimensional speech source localization accuracy in the microphonearray network system of FIG. 7;

FIG. 14 is a graph showing measurement results indicating improvementsin the three-dimensional speech source localization accuracy in themicrophone array network system of FIG. 7;

FIG. 15 is a schematic view showing a configuration of a microphonearray network system according to a second implemental example of thepresent invention;

FIG. 16 is an explanatory view for explaining the speech sourcelocalization system of the second implemental example of FIG. 15;

FIG. 17 is a block diagram showing a configuration of a network used inthe position measurement system of the second preferred embodiment ofthe present invention;

FIG. 18A is a perspective view showing a method of flooding timesynchronization protocol (FTSP) used in the position measurement systemof FIG. 17;

FIG. 18B is a timing chart showing a condition of data propagationindicative of the method;

FIG. 19 is a graph showing time synchronization with linearinterpolation used in the position measurement system of FIG. 17;

FIG. 20A is a first part of a timing chart showing a signal transmissionprocedure between tablets, and processes executed at the tablets in theposition measurement system of FIG. 17;

FIG. 20B is a second part of the timing chart showing a signaltransmission procedure between the tablets, and the processes executedat the tablets in the position measurement system of FIG. 17;

FIG. 21 is a plan view showing a method for measuring distances betweenthe tablets from angle information measured at the tablets of theposition measurement system of FIG. 17;

FIG. 22 is a block diagram showing a configuration of the node device ofa data aggregation system for a microphone array network systemaccording to a third preferred embodiment of the present invention;

FIG. 23 is a block diagram showing a detailed configuration of the datacommunication part 57 a of FIG. 22;

FIG. 24 is a table showing a detailed configuration of a table memory inthe parameter memory 57 b of FIG. 23;

FIGS. 25A to 25D are schematic plan views showing processing operationsof the data aggregation system of FIG. 22, in which FIG. 25A is aschematic plan view showing FTSP processing from the base station androuting (T11), FIG. 25B is a schematic plan view showing speech activitydetection (VAD) and detection message transmission (T12), FIG. 25C is aschematic plan view showing wakeup message and clustering (T13), andFIG. 25D is a schematic plan view showing cluster selection anddelay-sum processing (T14);

FIG. 26A is a timing chart showing a first part of the processingoperation of the data aggregation system of FIG. 22;

FIG. 26B is a timing chart showing a second part of the processingoperation of the data aggregation system of FIG. 22; and

FIG. 27 is a plan view showing a configuration of an implemental exampleof the data aggregation system of FIG. 22.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Preferred embodiments of the present invention will be described belowwith reference to the drawings. In the following preferred embodiments,like components are denoted by like reference numerals.

As described in the prior art, an independent distributed type routingalgorithm is indispensable in a sensor network configured to include anumber of node devices. A plurality of source origins of signals of theobject of sensing exist in a sensing area, and routing using clusteringis effective for configuring optimal paths for them. According to thepreferred embodiments of the present invention, a sensor network systemcapable of efficiently performing data aggregation by using a speechsource localization system in a sensor network system relevant to amicrophone array network system intended for acquiring a speech of ahigh sound quality, and a communication method therefor are describedbelow.

First Preferred Embodiment

FIG. 1 is a block diagram showing a detailed configuration of a nodedevice that is used in a speech source localization system according toa first preferred embodiment and also used in a position measurementsystem according to a second preferred embodiment of the presentinvention. The speech source localization system of the presentpreferred embodiment is configured by using, for example, a ubiquitousnetwork system (UNS), and the speech source localization system isconfigured to be a large-scale microphone array speech processing systemas a whole by connecting small-scale microphone arrays (sensor nodedevices) each having, for example, 16 microphones on a predeterminednetwork. In this case, a microphone processor is mounted on each of thesensor node devices, and speech processing is performed in a distributedand cooperative manner.

Referring to FIG. 1, each sensor node device is configured to includethe following:

(1) an AD converter circuit 51 connected to a plurality of sound pickupmicrophones 1;

(2) a speech estimation processor part (for voice activity detection,hereinafter referred to as a VAD processor part, and VAD is referred toas speech activity detection hereinafter) 52 connected to the ADconverter circuit 51 to detect a speech signal;

(3) an SRAM (Static Random Access Memory) 54, which temporarily stores aspeech signal or a speech signal including a sound signal or the like(the sound signal means a signal at an audio frequency of, for example,500 Hz or an ultrasonic signal) that has been subjected to AD-conversionby the AD converter circuit 51;

(4) an SSL processor part 55, which executes speech source localizationprocessing to estimate the position of a speech source for the digitaldata of a speech signal or the like outputted from the SRAM 54, andoutputs the results to the SSS processor part 56;

(5) an SSS processor part 56, which executes a speech source separationprocess to extract a specific speech source for the digital data of thespeech signal or the like outputted from the SRAM 54 and the SSLprocessor part 55, and collects speech data of high SNR obtained as theresults of the process by transceiving the data to and from other nodedevices via a network interface circuit 57; and

(6) a network interface circuit 57, which configures a datacommunication part to transceive speech data, and is connected to otherperipheral sensor node devices Nn (n=1, 2, . . . , N).

The sensor node devices Nn (n=0, 1, 2, . . . , N) have the sameconfiguration as each other, and the sensor node device N0 of the basestation can obtain speech data whose SNR is further improved byaggregating the speech data in the network. It is noted that the VADprocessor part 52 and a power supply manager part 53 are used for thespeech source localization of the first preferred embodiment, whereasthey are not used as a principle in the position estimation of thesecond preferred embodiment. Moreover, distance estimation describedlater is executed in, for example, the SSL processor part 55.

In the system configured as above, input speech data from 16 microphones1 is digitized by the AD converter circuit 51, and the information ofthe speech data is stored into the SRAM 54. Subsequently, theinformation is used for speech source localization and speech sourceseparation. The speech processing including them is executed by thepower supply manager part 53 that saves standby electricity and the VADprocessor part 52. The speech processor part is turned off when nospeech exists in the peripheries of the microphone array, and the powermanagement is basically necessary because the numbers of microphones 1waste much power when not in use.

FIG. 2 is a flow chart showing processing in the microphone arraynetwork system used in the system of FIG. 1.

Referring to FIG. 2, a speech is inputted from one microphone 1 (S1),and a detection process (S2) of a speech activity (VA) is executed. Inthis case, the number of zero-cross points are counted (S2 a), and it isjudged whether or not the speech activity (speech estimation) has beendetected (S2 b). When the speech activity is detected, the peripheralsub-arrays are set into a wakeup mode (S3), and the speeches of all themicrophones 1 are inputted (S4). Then, in a speech source localizingprocess (S5), after performing direction estimation in each sub-array(S5 a), communication of position information (S5 b) and a speech sourcelocalizing process (S5 c), a speech source separation process (S6) isperformed. In this case, separation in the sub-array (S6 a),communication of speech data (S6 b) and further separation of the speechsource (S6 c) are executed, and the speech data is outputted (S7).

The distinguished features of the present system are as follows.

(1) In order to activate the entire node device, low-power speechactivity detection is performed.

(2) For the speech source localization, the speech source is localized(auditorily localized).

(3) In order to reduce the sound noise level, the speech sourceseparation process is performed.

Moreover, the sub-array node devices are mutually connected to supportintercommunications. Therefore, the speech data obtained at the nodedevices can be collected to further improve the SNR of the speechsource. In the present system, a number of microphone arrays areconfigured via interactions with the peripheral node devices. Therefore,calculation can be distributed among the node devices. The presentsystem has scalability (extendability) in the aspect of the number ofmicrophones. Moreover, each of the node devices executes preparatoryprocessing for the picked-up speech data.

FIG. 3 is a waveform chart showing a speech activity detection (VAD:voice activity detection) at the zero-cross points used in the system ofFIG. 1.

The microphone array network of the present preferred embodiment isconfigured to include a number of microphones whose power consumptioneasily becomes tremendous. An intelligent microphone array systemaccording to the present preferred embodiment is required to operatewith a limited energy source in order to save power as far as possible.Since the speech processing unit and the microphone amplifier consumepower to a certain extent even when the environment is quiet, speechprocessing with power saving is effective. Although the present inventorand others has proposed low power consumption VAD hardwareimplementation to reduce the standby electricity of the sub-arrays in aconventional apparatus, a zero-cross algorithm for VAD is used in thepresent preferred embodiment. As apparent from FIG. 3, the speech signalcrosses a trigger line that is a high trigger value or a low triggervalue, and thereafter, the zero-cross point is located at the firstintersection of the input signal and an offset line. The abundance ratioof the zero-cross points remarkably differs between a speech signal anda non-speech signal. The zero-cross VAD detects the speech by detectingthis difference and outputting the first point and the last point of thespeech interval. The only requirement is to capture the crossing pointsthroughout the range of the trigger line to the offset line. At thistime, no detailed speech signal needs to be detected, and the samplingfrequency and the bit count can be consequently reduced.

According to the VAD of the present inventor and others, the samplingfrequency can be reduced to 2 kHz, and the bit count per sample can beset to 10 bits. A single microphone is sufficient for detecting asignal, and the remaining 15 microphones are also turned off likewise.These values are sufficient for detecting the human words, and in thiscase, the 0.18-μm CMOS process consumes only power of 3.49 μW.

By separating the low-power VAD processor part 52 from the speechprocessor part, the speech processor part (SSL processor part 55, SSSprocessor part 56, etc.) can be turned off by using the power supplymanager part 53. Further, all the VAD processor parts 52 of all the nodedevices are required to operate. The VAD processor part 52 is activatedmerely by a limited number of node devices in the system. In the VADprocessor part 52, a processor relevant to the main signal startsexecution upon detecting a speech signal, and the sampling frequency andthe bit count are increased to sufficient values. It is noted that theparameters to determine the analog factors in the specifications of theAD converter circuit 51 can be changed in accordance with the specificapplication integrated in the system.

Next, a distributedly arranged speech capturing process is describedbelow. FIG. 4 is a block diagram showing a detail of the delay-sumcircuit part used in the system of FIG. 1. In order to acquire high-SNRspeech data, the following two types of techniques have been proposed:

(1) a technique using geometrical position information; and

(2) a statistical technique using no position information to improve themain speech source.

The system of the present preferred embodiment was premised on the factthat the node device positions in the network had been known, andtherefore, a delay-sum beam to form an algorithm classified in thegeometrical method (See, for example, the Non-Patent Document 6 and FIG.4) was selected. This method obtains less distortion than thestatistical method. Fortunately, it needs a small amount of calculationsand is simply applicable to distributed processing. A key point tocollect speech data from the distributed node devices is to juxtaposespeech phases between adjacent node devices, and, in this case, a phasemismatch (=time delay) is generated by a difference in the distance fromthe speech source to each of the node devices.

FIG. 5 is a plan view showing a basic principle of a plurality of thedistributedly arranged delay-sum circuit parts of FIG. 4, and FIG. 6 isa graph showing a time delay from a speech source indicative ofoperation in the system of FIG. 5. In the present preferred embodiment,a two-layer algorithm is introduced to achieve a distributed delay-sumbeam formed as shown in FIG. 5. In a local layer, each of the nodedevices collects speeches in 16 channels having local delays from theorigin of the node device, and thereafter, the spread single sound isacquired into the node device by using the basic delay-sum algorithm.Subsequently, speech data emphasized with a definite global delay thatcan be calculated by the position of an addition array is transmitted tothe adjacent node devices of a global layer and finally aggregated intospeech data that has high SNR. A vocal packet includes a time stamp andthe speech data of 64 samples. In this case, the time stamp is given asT_(Packet)=T_(REC)−D_(sender). In this case, T_(REC) represents a timervalue in the sending side node device when the speech data in the packetis recorded, and D_(Sender) represents a global delay at the origin ofthe sending side node device. In the receiving side node device,adjustment is performed by adding the global delay (D_(Receiver)) toT_(Packet) in the received time stamp, and the speech data is aggregatedin the delay-sum form (FIG. 6). Each of the node devices transmits thespeech data in the single channel, whereas the high-SNR speech data canbe consequently acquired in the base station.

FIG. 7 shows an explanatory view of the speech source localization ofthe present invention. Referring to FIG. 7, six node devices havingmicrophone arrays and one speech processing server 20 are connectedtogether via a network 10. The six node devices having the microphonearrays configured by arranging a plurality of microphones in an arrayform exist on four indoor wall surfaces, the direction of the speechsource is estimated by a processor for speech pickup processing existingin each of the node devices, and the position of the speech source isspecified by integrating the results in the speech processing server. Byvirtue of data processing executed at each of the node devices, thecommunication traffic of the network can be reduced, and the calculationamount is distributed among node devices.

Detailed descriptions are provided below separately for a case oftwo-dimensional speech source localization and a case ofthree-dimensional speech source localization. First of all, thetwo-dimensional speech source localization method of the presentinvention is described with reference to FIG. 8. FIG. 8 describes thetwo-dimensional speech source localization method. Referring to FIG. 8,the node device 1 to the node device 3 estimate the directions of thespeech source from pickup speech signals picked up from the respectivemicrophone arrays. Each of the node devices calculates the responseintensity of the MUSIC method in each direction, and estimates adirection in which the maximum value is taken to be the direction of thespeech source. FIG. 8 shows a case where the node device 1 calculatesthe response intensity in the directions of −90 degrees to 90 degrees onan assumption that the perpendicular direction (frontward direction) ofthe array plane of the microphone array is 0 degree, and the directionof θ1=−30 degrees is estimated to be the direction of the speech source.The node device 2 and the node device 3 each also calculate likewise theresponse intensity in each direction, and estimate the direction inwhich the maximum value is taken to be the direction of the speechsource.

Then, weighting is performed for the intersections of the speech sourcedirection estimation results of two node devices between the node device1 and the node device 2, between the node device 1 and the node device3, and so on. In this case, the weight is determined on the basis of themaximum response intensity of the MUSIC method of each of the nodedevices (e.g., the product of the maximum response intensities of twonode devices). In FIG. 8, the scale of the weight is expressed by theballoon diameter at each intersection.

The balloons (positions and scales) that represent a plurality ofobtained weights become speech source position candidates. Then, thespeech source position is estimated by obtaining the barycenter of theplurality of obtained speech source position candidates. In the case ofFIG. 8, obtaining the barycenter of the plurality of speech sourceposition candidates is to obtain the weighted barycenter of the balloons(positions and scales) that represent the plurality of weights.

The three-dimensional speech source localization method of the presentinvention is described next with reference to FIG. 9. FIG. 9 describesthe three-dimensional speech source localization method. Referring toFIG. 9, the node device 1 to the node device 3 estimate the directionsof the speech source from the pickup speech signals picked up from therespective microphone arrays. Each of the node devices calculates theresponse intensity of the MUSIC method in the three-dimensionaldirections, and estimates the direction in which the maximum value istaken to be the direction of the speech source. FIG. 9 shows a casewhere the node device 1 calculates the response intensity in therotation coordinate system in the perpendicular direction (frontwarddirection) of the array plane of the microphone array, and estimates thedirection of the greater intensity is estimated to be the direction ofthe speech source. The node device 2 and the node device 3 each alsocalculate likewise the response intensity in each direction, andestimate the direction in which the maximum value is taken to be thedirection of the speech source.

Then, weighting is performed for the intersections of speech sourcedirection estimation results of two node devices between the node device1 and the node device 2, between the node device 1 and the node device3, and so on. However, it is often the case where no intersection can beobtained in the three-dimensional case. Therefore, the intersection isobtained virtually on a line segment that connects the straight lines ofthe speech source direction estimation results of two node devices atthe shortest distance. It is noted that the weight is determined on thebasis of the maximum response intensity of the MUSIC method at each ofthe node devices (e.g., the product of the maximum response intensitiesof two node devices) in a manner similar that of the two-dimensionalcase. In FIG. 9, the scale of the weight is expressed by the balloondiameter at each intersection in a manner similar that of FIG. 8.

The balloons (positions and scales) that represent a plurality ofobtained weights become speech source position candidates. Then, thespeech source position is estimated by obtaining the barycenter of theplurality of obtained speech source position candidates. In the case ofFIG. 9, obtaining the barycenter of the plurality of speech sourceposition candidates is to obtain the weighted barycenter of the balloons(positions and scales) that represent the plurality of weights.

First Implemental Example

One implemental example of the present invention is described. FIG. 10is a schematic view of a microphone array network system of the firstimplemental example. FIG. 10 shows the system configuration in whichnode devices (1 a, 1 b, . . . , 1 n) each having a microphone array of16 microphones arranged in an array form and one speech processingserver 20 are connected together via a network 10. In each of the nodedevices, as shown in FIG. 11, signal lines of the 16 microphones (m11,m12, . . . , m43, m44) arranged in an array form are connected to theinput and output part (I/O part) 3 of the speech pickup processor part2, and signals picked up from the microphones are inputted to theprocessor 4 of the speech pickup processor part 2. The processor 4 ofthe speech pickup processor part 2 estimates the direction of the speechsource by executing processing of the algorithm of the MUSIC methodusing the inputted speech pickup signal.

Then, the processor 4 of the speech pickup processor part 2 transmitsthe speech source direction estimation results and the maximum responseintensity to the speech processing server 20 shown in FIG. 7.

As described above, speech localization is distributedly performed ineach of the node devices, the results are integrated in the speechprocessing server, and the aforementioned two-dimensional localizationand the three-dimensional localization processing are performed toestimate the position of the speech source.

FIG. 12 is a functional diagram showing functions of the microphonearray network system of the first implemental example.

The node device having the microphone array subjects the signal from themicrophone array to A/D conversion (step S11), and receives the speechpickup signal of each microphone as an input (step S13). By using thespeech signals picked up from the microphones, the direction of thespeech source is estimated by the processor mounted on the node deviceoperating as the speech pickup processor part (step S15).

As shown in the graph of FIG. 12, the speech pickup processor partcalculates the response intensity of the MUSIC method within thedirectional angles of −90 degrees to 90 degrees with respect to 0 degreeassumed to be the front (perpendicular direction) of the microphonearray. Then, the direction in which the response intensity is intense isestimated to be the direction of the speech source. The speech pickupprocessor part is connected to the speech processing server via anetwork not shown in the figure, and the speech source directionestimation result (A) and the maximum response intensity (B) are dataexchanged in the node device (step S17). The speech source directionestimation result (A) and the maximum response intensity (B) are sent tothe speech processing server.

In the speech processing server, the data sent from respective nodedevices are received (step S21). A plurality of speech source positioncandidates are calculated from the maximum response intensity of each ofthe node devices (step S23). Then, the position of the speech source isestimated on the basis of the speech source direction estimation result(A) and the maximum response intensity (B) (step S25).

The three-dimensional speech source localization accuracy is describedbelow. FIG. 13 schematically shows the condition of an experiment ofthree-dimensional speech source localization accuracy. A room having afloor area of 12 meters×12 meters and a height of 3 meters is assumed.Sixteen sub-arrays configured by placing 16 microphone arrays ofmicrophones arranged in an array form at equal intervals in fourdirections on the floor surface were assumed (Case A of 16 sub-arrays).Moreover, 41 sub-arrays configured by placing 16 microphone arrays atequal intervals in four directions on the floor surface, placing 16microphone arrays at equal intervals in four directions on the ceilingsurface, and placing nine microphone arrays at equal intervals on thefloor surface were assumed (Case B of 41 sub-arrays). Moreover, 73sub-arrays configured by placing 32 microphone arrays at equal intervalsin four directions on the floor surface, placing 32 microphone arrays atequal intervals in four directions on the ceiling surface, and placingnine microphone arrays at equal intervals on the floor surface wereassumed (Case C of 73 sub-arrays).

By using the three Cases A to C, the number of node devices and thedispersion of speech source direction estimation errors of the nodedevices were changed, and the results of three-dimensional positionestimation were compared to one another. Regarding the three-dimensionalposition estimation, each of the node devices selects one other party ofcommunication at random, and obtains a virtual intersection.

The results of the measurement are shown in FIG. 14. The horizontal axisof FIG. 14 represents the dispersion (standard deviation) of thedirection estimation error, and the vertical axis represents theposition estimation error. It can be understood from the results of FIG.14 that the accuracy of three-dimensional position estimation can beimproved by increasing the number of node devices even if the estimationaccuracy of the speech source direction is bad.

Second Implemental Example

Another implemental example of the present invention is described. FIG.16 shows a schematic view of a microphone array network system accordingto a second implemental example. FIG. 17 shows a system configurationsuch that node devices (1 a, 1 b, 1 c) each having a microphone array inwhich 16 microphones are arranged in an array form are connected vianetworks (11, 12). In the case of the system of second implementalexample, no speech processing server exists that is different from thesystem configuration of the first implemental example. Moreover, asshown in FIG. 11 in a manner similar to that of the first implementalexample, signal lines of arrayed 16 microphones (m11, m12, . . . , m43,m44) are connected to the I/O part 3 of the speech pickup processor part2 at each of the node devices, and signals picked up from themicrophones are inputted to the processor 4 of the speech pickupprocessor part 2. The processor 4 of the speech pickup processor part 2estimates the direction of the speech source by executing processing ofthe algorithm of the MUSIC method.

Then, the processor 4 of the speech pickup processor part 2 exchangesdata of speech source direction estimation results between the processorand adjacent node devices and other node devices. The processor 4 of thespeech pickup processor part 2 executes processing of the aforementionedtwo-dimensional localization or three-dimensional localization from thespeech source direction estimation results and the maximum responseintensities of the plurality of node devices including the self-nodedevice, and estimates the position of the speech source.

Second Preferred Embodiment

FIG. 1 is a block diagram showing a detailed configuration of a nodedevice used in a position measurement system according to the secondpreferred embodiment of the present invention. The position measurementsystem of the second preferred embodiment is characterized in measuringthe position of a terminal more accurately than that in the prior art byusing the speech source localization system of the first preferredembodiment. The position measurement system of the present preferredembodiment is configured by employing, for example, a ubiquitous networksystem (UNS). By connecting small-scale microphone arrays (sensor nodedevices) each having, for example, 16 microphones via a predeterminednetwork, a large-scale microphone array speech processing system isconfigured as a whole, thereby configuring a position measurementsystem. In this case, microphone processors are mounted on therespective sensor node devices, and speech processing is performeddistributedly and cooperatively.

The sensor node device has the configuration of FIG. 1, and, one exampleof the processing at each sensor node device is described hereinbelow.First of all, all the sensor node devices are in the sleep state in theinitial stage. At several sensor node devices located apart to a certainextent, for example, one sensor node device transmits a sound signal fora predetermined time interval such as three seconds, and a sensor nodedevice that detects the sound signal starts speech source directionestimation by multi-channel inputs. At the same time, a wakeup messageis broadcasted to other sensor node devices existing in the peripheries,and the sensor node devices that have received the message alsoimmediately start speech source direction estimation. After completingthe speech source direction estimation, each sensor node devicetransmits an estimated result to the base station (sensor node deviceconnected to the server apparatus). The base station estimates thespeech source position by using the collected direction estimationresults of the sensor node devices, and broadcasts the results towardall the sensor node devices that have performed the speech sourcedirection estimation. Next, each sensor node device performs speechsource separation by using the position estimation results received fromthe base station. In a manner similar that of the speech sourcelocalization, speech source separation is executed separately in twosteps internally at each sensor node device and among sensor nodedevices. The speech data obtained at the sensor node devices areaggregated again in the base station via the network. The finallyobtained high-SNR speech signal is transmitted from the base station tothe server apparatus, and used for a predetermined application on theserver apparatus.

FIG. 17 is a block diagram showing a configuration (concrete example) ofa network used in the position measurement system of the presentpreferred embodiment. FIG. 18A is a perspective view showing a method offlooding time synchronization protocol (FTSP) used in the positionmeasurement system of FIG. 17, and FIG. 18B is a timing chart showing acondition of data propagation indicative of the method. In addition,FIG. 19 is a graph showing time synchronization with linearinterpolation used in the position measurement system of FIG. 17.

Referring to FIG. 17, sensor node devices N0 to N2 including the serverapparatus SV are connected by way of, for example, UTP cables 60, andcommunications are performed by using 10BASE-T Ethernet (registeredtrademark). In the present implemental example, the sensor node devicesN0 to N2 are connected with each other in a linear topology, where onesensor node device N0 operates as a base station and is connected to aserver apparatus SV configured to include, for example, a personalcomputer. The known low power listening method is used for powerconsumption saving in the data-link layer of the communication system,and the known tiny diffusion method is used for the path formulation inthe network layer.

In order to aggregate speech data among the sensor node devices N0 to N2in the present preferred embodiment, it is required to synchronize time(timer value) at all the sensor node devices in the network. In thepresent preferred embodiment, a synchronization technique configured byadding linear interpolation to the known flooding time synchronizationprotocol (FTSP) is used. The FTSP is to achieve high-accuracysynchronization only by simple communications in one direction. Althoughthe synchronization accuracy by the FTSP is equal to or smaller than onemicrosecond between adjacent sensor node devices, there are variationsin the quartz oscillators owned by the sensor node devices, and a timedeviation disadvantageously occurs with a lapse of time after thesynchronization process as shown in FIG. 19. The deviation ranges fromseveral microseconds to several tens of microseconds per second, and itis concerned that the performance of speech source separation might bedegraded.

FIG. 18A is a perspective view showing a method of flooding timesynchronization protocol (FTSP) (See, for example, the Non-PatentDocument 8) used in the position measurement system of FIG. 17, and FIG.18B is a timing chart showing a condition of data propagation indicativeof the method.

In the proposed system of the present preferred embodiment, a timedeviation between sensor node devices is stored at the time of timesynchronization by the FTSP, and the time progress of the timer isadjusted by linear interpolation. Assuming that a reception time stampat the first synchronization time is the timer value on the receivingside, by adjusting the time progress of the timer only in the period ofa time stamp at the second synchronization time, the dispersion of theoscillation frequency can be corrected. With this arrangement, a timedeviation after completing the synchronization can be suppressed within0.17 microseconds per second. Even if the time synchronization by theFTSP occurs once in one minute, the time deviation between sensor nodedevices is suppressed within 10 microseconds by performing linearinterpolation, and the performance of the speech source separation canbe maintained.

By storing a relative time (e.g., the elapsed time is defined as arelative time on an assumption that the time when the first sensor nodedevice is turned on is zero) or the absolute time (e.g., the day, hour,minute and second on a calendar is set as the time), the timesynchronization is performed among the sensor node devices by theaforementioned method. The time synchronization is used for measuringthe accurate distance between sensor node devices as described later.

FIGS. 20A and 20B are timing charts showing a signal transmissionprocedure among tablets T1 to T4 and processes executed at the tabletsT1 to T4 in the position measurement system of the second preferredembodiment. In this case, the tablets T1 to T4 having, for example, theconfiguration of FIG. 1 is configured to include the aforementionedsensor node devices. The following description describes a case wherethe tablet T1 is assumed to be a master, and the tablets T2 to T4 areassumed to be slaves. However, it is acceptable to arbitrarily set thenumber of tablets and use any tablet as the master. Moreover, the soundsignal may be audible sound waves, ultrasonic waves exceeding thefrequencies in the audible range or the like. In this case, regardingthe sound signal, for example, the AD converter circuit 51 mayadditionally include a DA converter circuit and generates anomni-directional sound signal from one microphone 1 in response to theinstruction of the SSL processor part 55 or may include an ultrasonicgenerator device and generate an ultrasonic omni-directional soundsignal in response to the instruction of the SSL processor part 55.Further, the SSS processing need not be executed in FIGS. 20A and 20B.

Referring to FIG. 20A, first in step S31, the tablet T1 transmits an“SSL instruction signal of an instruction to prepare for receiving thesound signal with the microphone 1 and execute the SSL processing inresponse to the sound signal” to the tablets T2 to T4, and thereafter,transmits a sound signal for a predetermined time of, for example, threeseconds after a lapse of a predetermined time. The SSL instructionsignal contains the transmission time information of the sound signal.The tablets T2 to T4 calculate a distance between the tablet T1 and theself-tablet by calculating a difference between the time when the soundsignal is received and the aforementioned transmission time information,i.e., the transmission time of the sound signal and multiplying theknown velocity of sound waves or ultrasonic waves by the calculatedtransmission time, and stores the calculated results into a built-inmemory. Moreover, the tablets T2 to T4 estimate and calculate thearrival direction of the sound signal by executing the speech sourcelocalizing process on the basis of the received sound signal using theMUSIC method (See, for example, the Non-Patent Document 7) described indetail in the first preferred embodiment, and stores the calculatedresults into the built-in memory. That is, the distance from the tabletT1 to the self-tablet, and an angle to the tablet T1 are estimated,calculated and stored by the SSL processing of the tablets T2 to T4.

Subsequently, in step S32, the tablet T1 transmits an “SSL instructionsignal of an instruction to prepare for receiving with the microphone 1and execute the SSL processing in response to the sound signal” to thetablets T3 and T4, and thereafter, transmits a sound generationinstruction signal to generate a sound signal to the tablet T2 after alapse of a predetermined time. In this case, the tablet T1 is alsobrought into a standby state of the sound signal. The tablet T2generates a sound signal in response to the sound generation instructionsignal, and transmits the signal to the tablets T1, T3 and T4. Thetablets T1, T3 and T4 estimate and calculate the arrival direction ofthe sound signal by executing the speech source localizing process onthe basis of the received sound signal using the MUSIC method describedin detail in the first preferred embodiment, and store the calculatedresults into the built-in memory. That is, an angle to the tablet T2 isestimated, calculated and stored by the SSL processing of the tabletsT1, T3 and T4.

Further, in step S33, the tablet T1 transmits an “SSL instruction signalof an instruction to prepare for receiving with the microphone 1 andexecute the SSL processing in response to the sound signal” to thetablets T2 and T4, and thereafter, transmits a sound generationinstruction signal to generate a sound signal to the tablet T3 after alapse of a predetermined time. In this case, the tablet T1 is alsobrought into the standby state of the sound signal. The tablet T3generates a sound signal in response to the sound generation instructionsignal, and transmits the signal to the tablets T1, T2 and T4. Thetablets T1, T2 and T4 estimate and calculate the arrival direction ofthe sound signal by executing the speech source localizing process onthe basis of the received sound signal using the MUSIC method describedin detail in the first preferred embodiment, and store the calculatedresults into the built-in memory. That is, an angle to the tablet T3 isestimated, calculated and stored by the SSL processing of the tabletsT1, T3 and T4.

Furthermore, in step S34, the tablet T1 transmits an “SSL instructionsignal of an instruction to prepare for receiving with the microphone 1and execute the SSL processing in response to the sound signal” to thetablets T2 and T3, and thereafter, transmits a sound generationinstruction signal to generate a sound signal to the tablet T4 after alapse of a predetermined time. In this case, the tablet T1 is alsobrought into the standby state of the sound signal. The tablet T4generates a sound signal in response to the sound generation instructionsignal, and transmits the signal to the tablets T1, T2 and T3. Thetablets T1, T2 and T3 estimate and calculate the arrival direction ofthe sound signal by executing the speech source localizing process onthe basis of the received sound signal using the MUSIC method describedin detail in the first preferred embodiment, and store the calculatedresults into the built-in memory. That is, an angle to the tablet T4 isestimated, calculated and stored by the SSL processing of the tabletsT1, T2 and T3.

Subsequently, in step S35 to perform data communications, the tablet T1transmits an information reply instruction signal to the tablet T2. Inresponse to this, the tablet T2 sends an information reply signal thatincludes the distance between the tablets T1 and T2 calculated in stepS31 and the angles when the tablets T1, T3 and T4 are viewed from thetablet T2 calculated in steps S31 to S34 back to the tablet T1.Moreover, the tablet T1 transmits an information reply instructionsignal to the tablet T3. In response to this, the tablet T3 sends aninformation reply signal that includes the distance between the tabletsT1 and T3 calculated in step S31 and the angles when the tablets T1, T2and T4 are viewed from the tablet T3 calculated in steps S31 to S34 backto the tablet T1. Further, the tablet T1 transmits an information replyinstruction signal to the tablet T4. In response to this, the tablet T4sends an information reply signal that includes the distance between thetablets T1 and T4 calculated in step S31 and the angles when the tabletsT1, T2 and T3 are viewed from the tablet T4 calculated in steps S31 toS34 back to the tablet T1.

In the SSL general processing of the tablet T1, the tablet T1 calculatesthe distances between the tablets on the basis of the informationcollected as described above as follows as described with reference toFIG. 21, and calculates the XY coordinates of the other tablets T2 to T4when, for example, the tablet T1 (A of FIG. 21) is assumed to be theorigin of the XY coordinates on the basis of the angle information wheneach of the tablets T1 to T4 view the other tablets by using thedefinitional equation of the known trigonometric function, therebyallowing the XY coordinates of all the tablets T1 to T4 to be obtained.The coordinate values may be displayed on a display or outputted to aprinter to be printed out. Moreover, it is acceptable to execute, forexample, a predetermined application described in detail later by usingthe aforementioned coordinate values.

The SSL general processing of the tablet T1 may be performed by only thetablet T1 that is the master or performed by all the tablets T1 to T4.That is, at least one tablet or server apparatus (e.g., SV of FIG. 17)is required to execute the processing. Moreover, the SSL processing andthe SSL general processing are executed by, for example, the SSLprocessor part 55 that is the control part.

FIG. 21 is a plan view showing a method for measuring distances betweenthe tablets from the angle information measured at the tablets T1 to T4(corresponding to A, B, C and D of FIG. 21) of the position measurementsystem of the second preferred embodiment. After all the tablets obtainthe angle information, the server apparatus calculates the distanceinformation of all the members. In the calculation of the distanceinformation, as shown in FIG. 21, the lengths of all sides are obtainedby the sine theorem by using the values of twelve angles and the lengthof any one side. Assuming that the length of AB is “d”, then the lengthof AC is obtained by the following equation:

$\begin{matrix}{{A\; C} = {\frac{d\; {\sin \left( {\theta_{BA} - \theta_{B\; C}} \right)}}{\sin \left( {\theta_{CB} - \theta_{CA}} \right)}.}} & (1)\end{matrix}$

The lengths of the other sides can be obtained likewise by using thetwelve angles and the length d. If each sensor node device can performthe aforementioned time synchronization, each sensor node device canobtain the distance from a difference between the speech start time andthe arrival time. Although the number of node devices is four in FIG.21, the present invention is not limited to this, and the distancebetween node devices can be obtained regardless of the number of nodedevices when the number of node devices is not smaller than two.

Although the two-dimensional position is estimated in the above secondpreferred embodiment, the present invention is not limited to this, andthe three-dimensional position may be estimated by using a similarnumerical expression.

Further, mounting of sensor node devices on a mobile terminal isdescribed below. Regarding the practical use of the network system, itcan be considered to not only use the sensor node devices fixed to awall and a ceiling but also mounted on a mobile terminal like a robot.If the position of a person to be recognized can be estimated, it ispossible to make a robot approach the person to be recognized for imagecollection of higher resolution and speech recognition of higheraccuracy. Moreover, mobile terminals such as smart phones that have beenrecently rapidly popularized have difficulties in acquiring thepositional relations of the terminals at a short distance although theycan acquire their own current positions by using the GPS function.However, if the sensor node devices of the present network system aremounted on mobile terminals, it is possible to acquire the positionalrelations of the terminals that are located at a short distance andunable to be discriminated by the GPS function or the like by performingspeech source localization by mutually dispatching speeches from theterminals. In the present preferred embodiment, two types of a messageexchange system and a multiplayer hockey game system were mounted asapplications that utilize the positional relations of the terminals byusing the programming language of java.

In the present preferred embodiment, a tablet personal computer toexecute the application and prototype sensor node devices were connectedtogether. A general-purpose OS is mounted as the OS of the tabletpersonal computer, and a wireless network is configured by having awireless LAN function compliant to USB2.0 ports in two places andIEEE802.1b/g/n protocol. The prototype sensor node device microphonesare arranged at intervals of 5 cm on four sides of the tablet personalcomputer, and a speech source localization module is operating at thesensor node devices (configured by an FPGA) to output localizationresults to the tablet personal computer. The position estimationaccuracy in the present preferred embodiment is about severalcentimeters, and the accuracy becomes remarkably higher than that of theprior art.

Third Preferred Embodiment

FIG. 22 is a block diagram showing a configuration of the node device ofa data aggregation system for a microphone array network systemaccording to the third preferred embodiment of the present invention,and FIG. 23 is a block diagram showing a detailed configuration of thedata communication part 57 a of FIG. 22. FIG. 24 is a table showing adetailed configuration of a table memory in the parameter memory 57 b ofFIG. 23. The data aggregation system of the third preferred embodimentis characterized in that a data aggregation system to efficientlyaggregate speech data is configured by using the speech sourcelocalization system of the first preferred embodiment and the speechsource localization system of the second preferred embodiment. Inconcrete, the communication method of the data aggregation system of thepresent preferred embodiment is used as a path formulation technique fora microphone array network system corresponding to a plurality of speechsources. The microphone array network is a technique to obtain ahigh-SNR speech signal by using a plurality of microphones. Byconfiguring a network by making the technique have a communicationfunction, wide-range high-SNR speech data can be collected. In thepresent preferred embodiment, by applying this to a microphone arraynetwork, an optimal path formulation can be achieved for a plurality ofspeech sources, allowing sounds from the speech sources to besimultaneously collected. With this arrangement, for example, an audioteleconference system or the like that can cope with a plurality ofspeakers can be actualized.

Referring to FIG. 22, each sensor node device is configured to includethe following:

(1) an AD converter circuit 51 connected to a plurality of microphones 1for speech pickup;

(2) a VAD processor part 52 connected to the AD converter circuit 51 todetect a speech signal;

(3) an SRAM 54, which temporarily stores speech data of a speech signaland the like including a speech signal or a sound signal that has beensubjected to AD conversion by the AD converter circuit 51;

(4) a delay-sum circuit part 58, which executes delay-sum processing forthe speech data stored in the SRAM 54;

(5) a microprocessor unit (MPU), which executes sound sourcelocalization processing to estimate the position of the speech sourcefor the speech data outputted from the SRAM 54, subjects the results tospeech source separation processing (SSS processing) and otherprocessing, and collects high-SNR speech data obtained as the result ofthe processing by transceiving the data to and from other node devicesvia the data communication part 57 a;

(6) a tinier and parameter memory 57 b, which includes a timer for timesynchronization processing and a parameter memory to store parametersfor data communications, and is connected to the data communication part57 a and the MPU 50; and

(7) a data communication part 57 a, which configures a network interfacecircuit to transceive the speech data, control packets and so on, and isconnected to other peripheral sensor node devices Nn (n=1, 2, . . . ,N).

Although the sensor node devices Nn (n=1, 2, . . . , N) have mutuallysimilar configurations, the sensor node device N0 of the base stationcan obtain speech data whose SNR is further improved by aggregating thespeech data in the network.

Referring to FIG. 23, the data communication part 57 a of FIG. 23 isconfigured to include the following:

(1) a physical layer circuit part 61, which transceives speech data,control packets and so on, and is connected to other peripheral sensornode devices Nn (n=1, 2, . . . , N);

(2) an MAC processor part 62, which executes medium access controlprocessing of speech data, control packets and so on, and is connectedto the physical layer circuit part 61 and a time synchronizing part 63;

(3) a time synchronizing part 63, which executes time synchronizationprocessing with other node devices, and is connected to the MACprocessor part 62 and the timer and parameter memory 57 b, and;

(4) a receiving buffer 64, which temporarily stores the speech data ordata of control packets and so on extracted by the MAC processor part62, and outputs them to a header analyzer 66;

(5) a transmission buffer 65, which temporarily stores packets of speechdata, control packets and so on generated by the packet generator part68, and outputs them to the MAC processor part 62;

(6) a header analyzer 66, which receives the packet stored in thereceiving buffer 64, analyzes the header of the packet, and outputs theresults to a routing processor part 67 or a VAD processor part 50, adelay-sum circuit part 52, and an MPU 59;

(7) a routing processor part 67, which determines routing as to whichnode device the packet is to be transmitted on the basis of analysisresults from the header analyzer 66, and outputs the result to thepacket generator part 68; and

(8) a packet generator part 68, which receives the speech data from thedelay-sum circuit part 52 or the control data from the MPU 59, generatesa predetermined packet on the basis of the routing instruction from therouting processor part 67, and outputs the packet to the MAC processorpart 62 via the transmission buffer 65.

Moreover, referring to FIG. 24, the table memory in the parameter memory57 b stores:

(1) self-node device information (node device ID and XY coordinates ofthe self-node device) that has been preparatorily determined and stored;

(2) path information (part 1) (transmission destination node device IDin the base station direction) acquired at time period T11;

(3) path information (part 2) (transmission destination node device IDof cluster CL1, transmission destination node device ID of cluster CL2,. . . , transmission destination node device ID of cluster CLN) acquiredat time period T12; and

(4) cluster information (cluster head node device ID (cluster CL1), XYcoordinates of speech source SS1, cluster head node device ID (clusterCL2), XY coordinates of speech source SS2, . . . , cluster head nodedevice ID (cluster CLN), XY coordinates of speech source SSN) acquiredat time periods T13 and T14.

It is assumed that the node devices Nn (n=1, 2, . . . , N) are locatedon a flat plane and has predetermined coordinates (known) in apredetermined XY coordinate system, and the position of each speechsource is measured by position measurement processing.

FIGS. 25A to 25D are schematic plan views showing processing operationsof the data aggregation system of FIG. 22, in which FIG. 25A is aschematic plan view showing FTSP processing from the base station androuting (T11), FIG. 25B is a schematic plan view showing speech activitydetection (VAD) and detection message transmission (T12), FIG. 25C is aschematic plan view showing wakeup message and clustering (T13), andFIG. 25D is a schematic plan view showing cluster selection anddelay-sum processing (T14). FIGS. 26A and 26B are timing charts showinga processing operation of the data aggregation system of FIG. 22.

In the operation example of FIGS. 25, 26A and 26B, there is shown theexample in which one-hop cluster is formed for each of two speechsources SSA and SSB, and speech data are collected into the lowerright-hand base station (one node device of a plurality of node devices,indicated by the symbol of a circle in a square) N0 with aggregating andemphasizing the data. First of all, the base station N0 of themicrophone array sensor node device performs inter-node device timesynchronization and broadcast for collecting path configuration by aspanning tree to the base station every interval of, for example, 30minutes by using a predetermined FTSP and the NNT (Nearest NeighborTree) protocol and simultaneously using a control packet CP (hollowarrow) (T11 of FIGS. 25A and 26A). The node devices (N1 to N8) otherthan the base station are subsequently set into a sleep mode until aspeech input is detected for power consumption saving. In the sleepmode, circuits except for the circuits including the AD convertercircuit 51 and the VAD processing part 52 of FIG. 22 and the circuits(physical layer circuit part 61, MAC processor part 62, and the timerand parameter memory 57 b of data communication part 57 a) for receivinga wakeup message are not supplied with power, and the power consumptioncan be remarkably reduced.

Subsequently, when speech signals are generated from the two speechsources SSA and SSB, the node devices (node devices N4 to N7 indicatedby black circle in FIGS. 25 and 26) at which the VAD processing part 52responds upon detecting a speech signal (i.e., utterance) transmit thedetected message toward the base station N0 by using the control packetCP through the path of the configured spanning tree (T12 of FIGS. 25Band 26A) and broadcasts a wakeup message for intersecting activation(activation message) by using the control packet CP (T13 of FIGS. 25Cand 26A). It is noted at this time that the broadcast range covers anumber of hops equivalent to the cluster distance to be configured (onehop in the case of the operation example of FIG. 25). Peripheralsleeping node devices (N1 to N3 and N8) are activated by the wakeupmessage, and a cluster that centers on the node device at which the VADprocessor part 52 responded is formed at the same time.

Subsequently, the node device at which the VAD processor part 52responded and the node devices (node devices N1 to N8 other than thebase station N0 in the operation example) activated by the wakeupmessage estimate the direction of the speech source by using themicrophone array network system, and transmit the results to the basestation N0. The path to be used at this time is the path of the spanningtree configured in FIG. 25A. The base station N0 geometrically estimatesthe absolute position of each speech source by using the method of theposition measurement system of the second preferred embodiment on thebasis of the speech source direction estimation results of the nodedevices and the known positions of the node devices. Further, the basestation N0 designates the node device located nearest to the speechsource among the originating node devices of the detection message asthe cluster head node device, and broadcasts the designation resulttogether with the absolute position of the estimated speech source toall the node devices (N1 to N8) of the entire network. If a plurality ofspeech sources SSA and SSB is estimated, cluster head node devices ofthe same number as the number of the speech sources is designated. Bythis operation, a cluster corresponding to the physical location of thespeech source is formed, and a path from each cluster head node deviceto the base station N0 is configured (T14 of FIGS. 25D and 26B). In theoperation example of FIGS. 25A to 25D, the node device N6 (indicated bydouble circle in FIG. 26D) is designated as the cluster head node deviceof the speech source SSA, and the node devices belonging to the clusterare N3, N6 and N7 within one hop from N6. Moreover, the node device N4(indicated by double circle in FIG. 26D) is designated as the clusterhead node device of the speech source SSB, and the node devicesbelonging to the cluster are N1, N3, N5 and N7 within one hop from N4.That is, the node devices located within the number of hops from thecluster head node devices N6 and N4 are clustered as the node devicesbelonging to the respective clusters. Then, the emphasizing process isperformed on the basis of the speech data measured at the node devicesbelonging to each cluster, and the speech data that have undergone theemphasizing process is transmitted to the base station N0. By thisoperation, the speech data that have undergone the emphasizing processfor each of the clusters corresponding to the speech sources SSA and SSBare transmitted to the base station N0 by using packets ESA and ESB. Inthis case, the packet ESA is the packet to transmit the speech dataobtained by emphasizing the speech data from the speech source SSA, andthe packet ESB is the packet to transmit the speech data obtained byemphasizing the speech data from the speech source SSB.

FIG. 27 is a plan view showing a configuration of an implemental exampleof the data aggregation system of FIG. 22. The present inventor andothers produced an experimental apparatus by using an FPGA (fieldprogrammable gate array) board to evaluate the network of the microphonearray of the present preferred embodiment. The experimental apparatushas the functions of a VAD processor part, speech source localization,speech source separation, and a wired data communication module. TheFPGA board of the experimental apparatus is configured to include16-channel microphones 1, and the 16-channel microphones 1 are arrangedin a 7.5-cm-interval grid form. The target of the present system is thehuman speech sound having a frequency range of 30 Hz to 8 kHz, andtherefore, the sampling frequency is set to 16 kHz.

In this case, the sub-arrays are connected together by using UTP cables.The 10BASE-T Ethernet (registered trademark) protocol is used as aphysical layer. In the data-link layer, the power consumption of theprotocol that adopts LPL (Low-Power-Listening) (See, for example, theNon-Patent Document 11) is reduced.

The present inventor and others conducted experiments with threesub-arrays in FIG. 27 in order to confirm the performance of theproposed system. Referring to FIG. 27, three sub-arrays are arranged,and one sub-array 1 located in the center position is connected as abase station to the server PC. In this case, the two-hop linear topologywas used to evaluate the multi-hop environment regarding the networktopology.

According to the signal waveforms measured after the timesynchronization processing, the maximum time lag immediately aftercompletion of the FTSP synchronization processing was 1 μs, and themaximum time lags between sub-arrays with linear interpolation andwithout linear interpolation were 10 microseconds and 900 microseconds,respectively, per minute.

Subsequently, referring to FIG. 27, the present inventor and othersevaluated the data capture of the speech by using the algorithm of thedistributed delay-sum circuit part. In this case, as shown in FIG. 27, asignal source of a sine wave at 500 Hz and noise sources (sine waves at300 Hz, 700 Hz and 1300 Hz) were used. According to the experimentalresults, the speech signal is enhanced, noises are reduced, and SNR isimproved as the microphones are increased in number. Moreover, it wasdiscovered that the noises at 300 Hz and 1300 Hz were drasticallysuppressed by 20 decibels without deteriorating the signal source (500Hz) in the condition of 48 channels. On the other hand, the noise at 700Hz is somewhat suppressed. This is presumably ascribed to the fact thatinterference was generated depending on the positions of the signalsource and the noise source. Moreover, according to another experiment,it was discovered that the noise source at 700 Hz is scarcely suppressedaround the positions of the noise source even in the case of 48channels. This problem is presumably avoidable by increasing the numberof node devices. Further, the present inventor and others also confirmedthat speech capture could be achieved by using three sub-arrays.

As described above, according to the prior art cluster base routing,clustering has been performed on the basis of only the information ofthe network layer. On the other hand, in order to configure a pathoptimized to each signal source in an environment in which a pluralityof signal sources of the object of sensing exist in a large-scale sensornetwork, a sensor node device clustering technique based on the sensinginformation has been necessary. Accordingly, the method of the presentinvention has actualized the path formulation more specified forapplications by using the signal information (information of theapplication layer) sensed in cluster head selection and the clusterconfiguration. Moreover, by combining the method with a wakeup mechanism(hardware) like the VAD processing part 52 in the microphone arraynetwork, the power consumption saving performance can be furtherimproved.

Although the sensor network system relevant to the microphone arraynetwork system intended for acquiring a speech of a high-quality soundhas been described in the aforementioned preferred embodiments, thepresent invention is not limited to this but allowed to be applied tosensor network systems relevant to a variety of sensors of temperature,humidity, person detection, animal detection, stress detection, opticaldetection, and the like.

Although the present invention has been fully described in connectionwith the preferred embodiments thereof with reference to theaccompanying drawings, it is to be noted that various changes andmodifications are apparent to those skilled in the art. Such changes andmodifications are to be understood as included within the scope of thepresent invention as defined by the appended claims unless they departtherefrom.

1. A sensor network system comprising a plurality of node devices eachhaving a sensor array and known position information, the node devicesbeing connected with each other in a network via predeterminedpropagation paths by using a predetermined communication protocol, thesensor network system collecting data measured at each of the nodedevices so as to be aggregated into one base station by using atime-synchronized sensor network system, wherein each of the nodedevices comprises: a sensor array configured to arrange a plurality ofsensors in an array form; a direction estimation processor part thatoperates when detecting a signal from a predetermined signal sourcereceived by the sensor array on the basis of the signal, to transmit adetected message to the base station and to estimate an arrivaldirection angle of the signal and transmit an angle estimation value tothe base station, and is activated in response to an activation messageat a time of detecting a signal received via a predetermined number ofhops from other node devices to estimate an arrival direction angle ofthe signal and transmit an angle estimation value to the base station;and a communication processor part that performs an emphasizing processon a signal from a predetermined signal source received by the sensorarray for each of the node devices belonging to a cluster designated bythe base station in correspondence with the speech source, and transmitsa signal that has undergone the emphasizing process to the base station,wherein the base station calculates a position of the signal source onthe basis of the angle estimation value of the signal from each of thenode devices and position information of each of the node devices,designates a node device located nearest to the signal source as acluster head node device, and transmits information of the position ofthe signal source and the designated cluster head node device to each ofthe node devices, thereby clustering each of the node devices locatedwithin the number of hops from the cluster head node device as a nodedevice belonging to each cluster, and wherein each of the node devicesperforms an emphasizing process on the signal from the predeterminedsignal source received by the sensor array for each of the node devicesbelonging to the cluster designated by the base station incorrespondence with the speech source, and transmits the signal that hasundergone the emphasizing process to the base station.
 2. The sensornetwork system as claimed in claim 1, wherein each of the node devicesis set into a sleep mode before detecting the signal and beforereceiving the activation message, and power supply to circuits otherthan a circuit that detects the signal and a circuit that receives theactivation message are stopped.
 3. The sensor network system as claimedin claim 1, wherein the sensor is a microphone to detect a speech.
 4. Acommunication method for use in a sensor network system comprising aplurality of node devices each having a sensor array and known positioninformation, the node devices being connected with each other in anetwork via predetermined propagation paths by using a predeterminedcommunication protocol, the sensor network system collecting datameasured at each of the node devices so as to be aggregated into onebase station by using a time-synchronized sensor network system, whereineach of the node devices comprises: a sensor array configured to arrangea plurality of sensors in an array form; a direction estimationprocessor part that operates when detecting a signal from apredetermined signal source received by the sensor array on the basis ofthe signal, to transmit a detected message to the base station and toestimate an arrival direction angle of the signal and transmit an angleestimation value to the base station and is activated in response to anactivation message at a time of detecting a signal received via apredetermined number of hops from other node devices to estimate anarrival direction angle of the signal and transmit an angle estimationvalue to the base station; and a communication processor part thatperforms an emphasizing process on a signal from a predetermined signalsource received by the sensor array for each of the node devicesbelonging to a cluster designated by the base station in correspondencewith the speech source, and transmits a signal that has undergone theemphasizing process to the base station, and wherein the communicationmethod including the following steps: calculating by the base station aposition of the signal source on the basis of the angle estimation valueof the signal from each of the node devices and position information ofeach of the node devices, designating a node device located nearest tothe signal source as a cluster head node device, and transmittinginformation of the position of the signal source and the designatedcluster head node device to each of the node devices, thereby clusteringeach of the node devices located within the number of hops from thecluster head node device as a node device belonging to each cluster, andperforming an emphasizing process by each of the node devices on thesignal from the predetermined signal source received by the sensor arrayfor each of the node devices belonging to the cluster designated by thebase station in correspondence with the speech source, and transmittingthe signal that has undergone the emphasizing process to the basestation.
 5. The communication method as claimed in claim 4, furtherincluding a step of setting each of the node devices into a sleep modebefore detecting the signal and before receiving the activation message,and stopping power supply to circuits other than a circuit that detectsthe signal and a circuit that receives the activation message.
 6. Thecommunication method as claimed in claim 4, wherein the sensor is amicrophone to detect a speech.